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(54) Microprocessor with load/store operation to/from multiple registers 



(57) A load multiple instruction may be executed in 
a superscaler microprocessor by dispatching a load 
multiple instruction to a load/store unit, wherein the load/ 
store unit begins execution of a dispatched load multiple 
instruction, and wherein the load multiple instruction 
loads data from memory into a plurality of registers. A 
table is maintained that lists each register of the plurality 
of registers and that indicates when data has been load- 
ed into each register by the executing load multiple in- 
struction. An instruction is executed that is dependent 
upon source operand data loaded by the load multiple 
instruction, prior to the load multiple instruction complet- 
ing its execution, when the table indicates the source 
operand data has been loaded into the source register. 
Also, a store multiple instruction may be executed by 
dispatching a store multiple instruction to the load/store 
unit, whereupon the load/store unit begins executing the 
store multiple instruction, wherein the load/store instruc- 
tion stores data from a plurality of registers to memory. 
A fixed point instruction is executed that is dependent 
upon data being stored by the store multiple instruction 
prior to the store multiple instruction completing its ex- 
ecution, but the executing fixed point instruction is pro- 
hibited from writing to a register of the plurality of regis- 
ters prior to the store multiple instruction completing. 
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Description 

The present invention relates in general to the allo- 
cation of resources during the execution of instructions 
in a microprocessor, and in particular to the allocation 
of resources in microprocessors which have a load/ 
store operation to multiple registers. 

M u It i- register load/store instructions require com- 
plete serialization since these instructions modify or use 
up to all the general purpose registers (usually 32) con- 
tained within a microprocessor In the PowerPC™ line 
of microprocessors produced by International Business 
Machines Corporation, integer load/store multiple in- 
structions are accommodated, which move blocks of da- 
ta to and from the microprocessor's general purpose 
registers (GPRs). The multi-register instructions provid- 
ed are the Load Multiple Word (Imw) and the Store Mul- 
tiple Word (stmw) instructions. 

In the prior art, because such multi-register load/ 
store instructions may modify or use up to all the general 
purpose registers in the system, later instructions in the 
instruction sequence are held in the instruction buffer 
until the multi-register instruction is complete. There- 
fore, it was assumed that multi-register instructions 
force complete serialization of these instructions in the 
instruction stream. To implement such a serialization in 
the prior art, the multi-register instructions are allocated 
all of the required general purpose registers until the in- 
struction is completed. Such a system substantially re- 
stricts performance by holding up the instruction pipe- 
line until the multi-register instruction is completed. 

Accordingly, the invention provides a method of ex- 
ecuting multiple instructions in a superscalar microproc- 
essor, including at least one load multiple instruction that 
loads to more than one register of a plurality of registers, 
comprising the steps of: 

dispatching a load multiple instruction to a load/ 
store unit, wherein the load/store unit begins exe- 
cution of a dispatched load multiple instruction, and 
wherein the load multiple instruction loads data 
from memory into a plurality of registers; 
maintaining a table that lists each register of the plu- 
rality of registers and that indicates when data has 
been loaded into each register by the executing 
load multiple instruction; and 
executing an instruction that is dependent upon 
source operand data loaded by the load multiple in- 
struction into a register of the plurality of registers 
indicated by the instruction as a source register, pri- 
or to the load multiple instruction completing its ex- 
ecution, when the table indicates the source oper- 
and data has been loaded into the source register. 

The invention further provides a method of execut- 
ing multiple instructions in a superscalar microproces- 
sor, including at least one store multiple instruction that 
stores data from more than one register of a plurality of 



registers to memory, comprising the steps of: 

dispatching a store multiple instruction to a load/ 
store unit, whereupon the load/store unit begins ex- 
5 ecuting the store multiple instruction, wherein the 

store multiple instruction stores data from a plurality 
of registers to memory; and 

executing a fixed point instruction that is dependent 
upon data being stored by the store multiple instruc- 

io tion from a register of the plurality of registers indi- 
cated by the fixed point instruction as a source reg- 
ister, prior to the store multiple instruction complet- 
ing its execution, but prohibiting the executing fixed 
point instruction from writing to a register of the plu- 

'5 rality of registers prior to the store multiple instruc- 
tion completing. 

In the preferred embodiment, said dependent in- 
struction does not complete until after said load or store 
20 multiple instruction has completed. 

The invention further provides a superscalar micro- 
processor having a plurality of execution units for simul- 
taneously executing a plurality of instructions and being 
connected to a memory, comprising: 

25 

a plurality of registers for selectively storing data re- 
sulting from execution by the execution units; 
a load/store execution unit that executes load and 
store instructions which load or store data to or from 
30 the plurality of registers; 

a fixed point execution unit for performing fixed 
point operations on operand data stored in source 
registers of the plurality of registers; 
and a dispatcher that maintains a table listing at 
35 least some of the plurality of registers and indicating 
when data has been loaded into each register by a 
load multiple instruction executing in the load/store 
execution unit, wherein a load multiple instruction 
loads data from memory into more than one of the 
40 plurality of registers, and wherein the dispatcher 
dispatches instructions to the plurality of execution 
units, including the load/store execution unit and the 
fixed point execution unit, and further wherein the 
dispatcher dispatches a load multiple instruction to 
45 the load/store unit, which begins execution of the 
load multiple instruction and further, prior to the load 
multiple instruction finishing its execution, dispatch- 
es a fixed point instruction which is dependent upon 
source operand data loaded by the load multiple in- 
50 struct ion into a register of the plurality of registers 
indicated by the instruction as a source register, 
when the table indicates the source operand data 
has been loaded into the source register by the ex- 
ecution of the load multiple instruction in the load/ 
55 store execution unit. 

The invention further provides a superscalar micro- 
processor having a plurality of execution units for simul- 
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taneousiy executing a plurality of instructions and being 
connected to a memory, comprising: 

a plurality of registers for selectively storing data re- 
sulting from execution by the execution units; 
a load/store execution unit that executes load and 
store instructions which load or store data to or from 
the plurality of registers; 

a fixed point execution unit for performing fixed 
point operations on operand data stored in source 
registers of the plurality of registers; 
a dispatcher that dispatches instructions to the plu- 
rality of execution units, including the load/store ex- 
ecution unit and the fixed point execution unit, 
wherein the dispatcher dispatches a store multiple 
instruction to the load/store unit, which begins exe- 
cution of the store multiple instruction, wherein a 
store multiple instruction stores data from more 
than one of the plurality of registers to memory, and 
further, prior to the store multiple instruction finish- 
ing, dispatches a fixed point instruction, which is de- - 
pendent upon source operand data stored in a re$? 
ister of the plurality of registers that is being stored 
out to memory by the executing store multiple in- 
struction, to the fixed point execution unit, wherein 
the fixed point execution unit executes the dis- 
patched fixed point instruction prior to the store mul- 
tiple instruction finishing, but does not store the re- 
sult of the executed fixed point instruction to a reg- 
ister of the plurality of registers prior to the store 
multiple instruction finishing. 

In a preferred embodiment, the dispatcher in the mi- 
croprocessor is capable of handling both load and store 
operations as described above. 

The approach described above allows resources to 
be deallocated from the multi-register instructions as 
those instructions are executed so that the resources 
become available to allow subsequent instructions to 
begin execution simultaneously with the executing mul- 
ti-register instruction. Thus the early deallocation of re- 
sources dedicated to a serialized load/store multiple op- 
eration allows simultaneous execution of additional in- 
structions. This substantially improves the performance 
of a microprocessor having a superscalar design by al- 
lowing instructions that utilize multiple registers to be ex- 
ecuted in parallel with subsequent instructions that uti- 
lize the general purpose registers. 

Thus in a preferred embodiment, a load multiple in- 
struction may be executed in a superscaler microproc- 
essor by -dispatching the load multiple instruction to a 
load/store unit. The load/store unit begins execution of 
a dispatched load multiple instruction, and the load mul- 
tiple instruction loads data from memory into a plurality 
of registers. A table is maintained that lists each register 
of the plurality of registers and indicates when data has 
been loaded into each register by the executing load 
multiple instruction. An instruction that is dependent up- 



on source operand data loaded by the load multiple in- 
struction into a register of the plurality of registers indi- 
cated by the instruction as a source register is executed 
prior to the load multiple instruction completing its exe- 

s cution, after the table indicates the source operand data 
has been loaded into the source register. 

Also in the preferred embodiment, a store multiple 
instruction may be executed in a superscaler microproc- 
essor by dispatching the store multiple instruction to a 

io load/store unit. The load/store unit begins executing the 
store multiple instruction, wherein the load store instruc- 
tion stores data from a plurality of registers to memory. 
A fixed point instruction that is dependent upon data be- 
ing stored by the store multiple instruction from a regis- 

is terof the plurality of registers indicated by the fixed point 
instruction as a source register is executed prior to the 
store multiple instruction completing its execution. How- 
ever, the executing fixed point instruction is prohibited 
from writing to a register of the plurality of registers prior 

20 to the store multiple instruction completing. 

A preferred embodiment of the invention will now 
be described by way of example only with reference to 
the following drawings: 

25 Figure 1 illustrates a block diagram of a processor; 

Figure 2 shows a timing diagram of the cycles re- 
quired to handle a load multiple instruction and sub- 
sequent fixed-point instructions; 
Figure 3 shows a timing diagram showing the cy- 
30 cles in which a store multiple instruction and two 
fixed-point instructions are handled in the micro- 
processor. 

With reference now to the figures and in particular 

35 with reference to Figure 1, there is illustrated a block 
diagram of a processor, indicated generally at 10, for 
processing information. In the depicted embodiment, 
processor 10 comprises a single integrated circuit su- 
perscalar microprocessor. Accordingly, as discussed 

40 further below, processor 10 includes various execution 
units, registers, buffers, memories, and other functional 
units, which are all formed by integrated circuitry. Proc- 
essor 10 may comprise one of the PowerPC™ line of 
microprocessors produced by International Business 

45 Machines Corporation which operates according to re- 
duced instruction set computing (RISC) techniques. 

As depicted in Figure 1, processor 10 is coupled to 
system bus 11 via a bus interface unit (BIU) 12 within 
processor 10. BIU 12 controls the transfer of information 

so between processor 10 and other devices coupled to sys- 
tem bus 11, such as a main memory (not illustrated). 
Processor 10, system bus 11, and the other devices 
coupled to system bus 11 together form a host data 
processing system. BIU 12 is connected to instruction 

55 cache 14 and data cache 16 within processor 10. High 
speed caches, such as instruction cache 14 and data 
cache 16, enable processor 10 to achieve relatively fast 
access time to a subset of data or instructions previously 
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transferred from main memory to the high speed cach- 
es, thus improving the speed of operation of the host 
data processing system. Instruction cache 14 is further 
coupled to sequential fetcher 17, which fetches instruc- 
tions from instruction cache 1 4 during each cycle for ex- 
ecution. Sequential fetcher 17 transfers branch instruc- 
tions to branch processing unit (BPU) 18 for execution, 
and transfers sequential instructions to instruction 
queue 19 for temporary storage before being executed 
by other execution circuitry within processor 10. 

In the depicted embodiment, in addition to BPU 18, 
the execution circuitry of processor 10 comprises mul- 
tiple execution units, including fixed-point unit (FXU)22, 
load/store unit (LSU) 28, and floating-point unit (FPU) 
30. As is well-known to those skilled in the computer art, 
each of execution units 22, 28, and 30 executes one or 
more instructions within a particular class of sequential 
instructions during each processor cycle. For example, 
FXU 22 performs fixed- point mathematical operations 
such as addition, subtraction, ANDing, ORing, and 
XORing, utilizing source operands received from spec- 
ified general purpose registers (GPRs) 32 or GPR re- 
name buffers 33. Following the execution of a fixed- 
point instruction, FXU 22 outputs the data results of the 
instruction to GPR rename buffers 33, which provide 
temporary storage lor the result data until the instruction 
is completed by transferring the result data from GPR 
rename buffers 33 to one or more of GPRs 32. Con- 
versely, FPU 30 performs floating-point operations, 
such as floating-point multiplication and division, on 
source operands received from floating-point registers 
(FPRs) 36 or FPR rename buffers 37. FPU 30 outputs 
data resulting from the execution of floating-point in- 
structions to selected FPR rename buffers 37, which 
temporarily store the result data until the instructions are 
completed by transferring the result data from FPR re- 
name buffers 37 to selected FPRs 36. LSU 28 executes 
floating-point and fixed-point instructions that either 
load data from memory (i.e., either data cache 16 or 
main memory) into selected GPRs 32 or FPRs 36, or 
that store data from a selected one of GPRs 32, GPR 
rename buffers 33, FPRs 36, or FPR rename buffers 37 
to memory. 

Processor 10 employs both pipelining and out-of- 
order execution of instructions to further improve the 
performance of its superscalar architecture. According- 
ly, instructions can be executed by FXU 22, LSU 28, and 
FPU 30 in any order as long as data dependencies are 
observed. In addition, instructions are processed by 
each of FXU 22, LSU 28, and FPU 30 at a sequence of 
pipeline stages. As is typical of high-performance proc- 
essors, each instruction is processed at five distinct 
pipeline stages, namely, fetch, decode/dispatch, exe- 
cute, finish, and completion. 

During the fetch stage, sequential fetcher 17 re- 
trieves one or more instructions associated with one or 
more memory addresses from instruction cache 14. Se- 
quential instructions fetched from instruction cache 14 



are stored by sequential fetcher 17 within instruction 
queue 19. Fetched branch instructions are removed 
from the instruction stream and are forwarded to BPU 
18 for execution. BPU 18 includes a branch prediction 
5 mechanism, such as a branch history table, that enables 
BPU 18 to speculatively execute unresolved conditional 
branch instructions by predicting whether the branch will 
be taken. 

During the decode/dispatch stage, dispatch unit 20 

10 decodes and dispatches one or more instructions from 
instruction queue 19 to the appropriate ones of execu- 
tion units 22, 28, and 30. Also during the decode/dis- 
patch stage, dispatch unit 20 allocates a rename buffer 
within GPR rename buffers 33 or FPR rename buffers 

is 37 for each dispatched instruction's result data. Accord- 
ing to a preferred embodiment of the present invention, 
processor 10 dispatches instructions in program order 
and tracks the program order of the dispatched instruc- 
tions during out-of-order execution utilizing unique in- 

20 struct ion identifiers. In addition to an instruction identifi- 
er, each instruction within the execution pipeline of proc- 
essor 1 0 has an rA tag and a rB tag, which indicate the 
sources of the A and B operands for the instruction, and 
a rD tag that indicates a destination rename buffer within 

25 GPR rename buffers 33 or FPR rename buffers 37 for 
the result data of the instruction. 

During the execute stage, execution units 22, 28, 
and 30, execute instructions received from dispatch unit 
20 opportunistically as operands and execution resourc- 

30 es for the indicated operations are available. After exe- 
cution has finished, execution units 22, 28, and 30 store 
result data within either GPR rename buffers 33 or FPR 
rename buffers 37, depending upon the instruction type. 
Then, execution units 22, 28, and 30 notify completion 

35 unit 40 which instructions have finished execution. Fi- 
nally, instructions are completed by completion unit 40 
in program order by transferring result data from GPR 
rename buffers 33 and FPR rename buffers 37 to GPRs 
32 and FPRs 36, respectively. 

40 Processor 10 is capable of executing multi-register 
load/store instructions, which load and store data in a 
plurality of general purpose registers from and to mem- 
ory. In particular, in the preferred embodiment having a 
PowerPC™ microprocessor, microprocessor 10 will ex- 

45 ecute a load multiple instruction (Imw), which loads mul- 
tiple words from memory, and a store multiple instruction 
(stmw), which stores multiple words to memory. 

These multi-register instructions are retrieved f /om 
instruction cache 1 4 by sequential fetcher 1 7 and loaded 

50 into instruction queue 19. Upon dispatch of a multi-reg- 
ister instruction by dispatch unit 20, LSU 28 will begin 
execution of the multi-register instruction. Also, upon 
dispatch of the instruction, a number of registers in GPR 
32 identified in the multi-register instruction are allocat- 
es ed to the instruction. 

In the preferred embodiment, the load multiple in- 
struction requires that up to 32 contiguous registers be 
loaded with up to 32 contiguous words from memory. 
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For example, the instruction "Imw r3, r2, r1 " will load reg- 
isters 3-31 with data found at location <r2 + r1 > in mem- 
ory. Thus, for this example, the first register to be loaded 
will be register 3 (r3). LSU 28 will then proceed to load 
register 4, register 5, and so on until all the registers up s 
to and including register 31 have been loaded. At that 
point the load multiple instruction has finished execu- 
tion. This is reported to completion unit 40, which com- 
pletes the instruction by committing it to architected reg- 
isters in the system. 10 

Referring to Figure 2, there is shown a timing dia- 
gram of the cycles required to handle the load multiple 
instruction and subsequent fixed-point instructions. The 
load multiple instruction (Load. Mult) is fetched (F) from 
instruction cache 1 4 by sequential fetcher 1 7 during cy- is 
cle.1. The instruction is decoded (Dec) during cycle 2 
and dispatched (Disp) by dispatch unit 20 to LSU 28 dur- 
ing cycle 3. LSU 28 executes (E) the load multiple in- 
struction during cycled 4-7, and the instruction is com- 
pleted (C) by completion unit 40 during cycle 8. 20 

In this example, four general purpose registers are 
loaded. In a preferred embodiment, the load multiple in- 
struction would be formatted as Imw r28, r2, r1 . This in- 
struction would load registers 28-31 , and, as seen in 
Figure 2, one register is loaded per system clock cycle 25 
for cycles 4-7. 

In the prior art, any fixed-point instructions subse- 
quent to the load multiple instruction would be serialized 
so that they would not be fetched until after the load mul- 
tiple instruction completed. This enables coherency of 30 
the operand data utilized by the subsequent fixed-point 
instructions to be maintained. Therefore, in the example 
of Figure 2, any subsequent fixed-point instructions 
could not be fetched until cycle 9 in the prior art. 

According to the present invention, a deallocation 35 
mechanism is provided that allows fixed-point instruc- 
tions waiting in the instruction buffer to be dispatched 
prior to completion of the load multiple instruction. As 
seen in Figure 2, prior to cycle 4, registers 28-31 are 
allocated to the load multiple instruction. However, once 40 
register 28, for example, has been loaded by the load 
multiple instruction, this resource is deallocated to allow 
its contents to be used as operand data for subsequent 
instructions. As a consequence, subsequent fixed-point 
instructions that are dependent upon the results of the 4S 
load multiple instruction can be dispatched to other func- 
tional units prior to the completion of the load multiple 
instruction. 

Processor 1 0 maintains a scoreboard or table for all 
general purpose registers (GPR) 32 that lists each reg- so 
isterand indicates when the load multiple instruction has 
loaded that associated register. Dispatch unit 20 ac- 
cesses the scoreboard to determine whether subse- 
quent instructions can be dispatched. For example, as 
seen in Figure 2, consider how to handle a load multiple 55 
instruction (Load Mult) followed by a first and second 
fixed-point instruction (FX Inst 1 and FX Inst 2), as rep- 
resented by the following instruction sequence: 



Imw r28, r2, r1 
add r2, r2, r28 

add r3, r3, r30 (Note: the "adds" are fixed-point 
instructions to add the contents of a first register to the 
contents of a second register and store the result oper- 
and in the first register) 

As can be seen in Figure 2, FX Inst 1 can be dis- 
patched as soon as register 28 is released by the load/ 
store unit to the scoreboard. During cycle 4 the load/ 
store unit has executed the load multiple instruction for 
register 28, so this register will be deallocated on the 
scoreboard. In the next subsequent cycle, dispatch unit 
20 will dispatch FX Inst 1 to FXU 22 because the source 
operand data for this instruction is now available in the 
general purpose registers. This instruction is executed 
by FXU 22 during cycle 6, but completion unit 40 does 
not complete the instruction until cycle 9 in order to guar- 
antee coherency of register data. As can be seen from 
Figure 2, FX Inst 2 is fetched and decoded during cycles 
3 and 4. However, dispatch unit 20 does not dispatch 
this instruction to FXU 22 until after the load multiple 
instruction has loaded register 30 during cycle 6 and has 
indicted this load on the scoreboard. Dispatch unit 20 
reads the deallocation of register 30 and dispatches the 
second fixed-point instruction during cycle 7, and FXU 
22 executes the instruction during cycle 8. Completion 
unit 40 does not complete this instruction until cycle 10 
because all the fixed-point instructions must be complet- 
ed in programming sequence. 

As shown in this example, there has been an im- 
provement of up to eight clock cycles to the dispatch 
and execution of the subsequent fixed-point instructions 
(assuming single ported register files and caches). This 
is a substantial increase in processor efficiency over the 
prior art. In fact, in certain examples, an improvement 
of up to 32 clock cycles could be realized. As will be 
appreciated, this enhanced performance is only limited 
by the depth of the completion buffer in the completion 
unit as to how many instructions can be in the execute 
pipeline before the load-multiple completes. 

Early deallocation of resources is also accom- 
plished during the execution of a store multiple instruc- 
tion. In the preferred embodiment, the store multiple op- 
eration requires that up to 32 contiguous registers be 
stored to up to 32 contiguous word locations in memory. 
For example, the store multiple instruction stmw r3, r2, 
r1 will store the contents of register 3 - register 31 to the 
memory located at <r2 + r1>. According to the present 
invention, upon dispatch of the store multiple instruction, 
additional subsequent fixed-point instructions can be 
dispatched to the other fixed-point execution units in the 
microprocessor, unconditionally. As before, these in- 
structions must complete in programming sequence, 
but execution may begin immediately after dispatch of 
these instructions. 

In the present invention, it is recognized that the 
store multiple instruction does not have to be serialized 
with subsequent fixed-point instructions. Consequently, 
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the entire contiguous set ot registers required for the 
store multiple instruction does not have to be allocated 
exclusively to that instruction, and instead can be used 
as a source operand resource for subsequent instruc- 
tions. However, any results of the subsequent instruc- 
tions must be stored in GPR rename buffers 33 until the 
store multiple instruction completes to prevent the writ- 
ing of registers which have not yet been stored. Upon 
completion of the store multiple instruction, the subse- 
quent instructions can be completed by transferring the 
result operands from GPR rename buffers 33 to an ar- 
chitected register in GPR 32. 

Referring now to Figure 3, there is shown a timing 
diagram of the cycles in which a store multiple instruc- 
tion (Store Mult) and two fixed-point instructions (FX 
Inst3 and FX Inst4) are handled in the microprocessor, 
in accordance with the preferred embodiment of the 
present invention. As an example, consider the handling 
of the following instruction sequence: 
stmw r28, r2, r1 
add r2, r2, r28 
add r3, r3, r30 
As can be seen in Figure 3, the store multiple in- 
struction is fetched during cycle 1, decoded in cycle 2, 
dispatched in cycle 3, executed by the load/store unit 
28 during cycles 4-7, and is completed in cycle 8. In ac- 
cordance with the present invention, FX Instl and FX 
Inst2 are dispatched as soon as possible after the prior 
store multiple instruction has been dispatched in cycle 
3. FX Instl is dispatched during cycle 4, and because 
only one instruction may be fetched per cycle, FX Inst2 
is dispatched during cycle 5. These fixed-point instruc- 
tions may execute immediately, as is done in cycles 5 
and 6, because the operand data required for execution 
is already present in registers 28 and 30, regardless of 
the progress of the store multiple instruction execution. 
In fact, during cycle 6, LSU 28 will store the data from 
register 30 out to memory and FXU 22 will add the op- 
erand data contained in register 30 to the operand data 
contained in register 3. The results of the fixed-point in- 
structions 1 and 2 are held in rename buffers 33 until 
cycles 9 and 10, respectively, at which time the result 
operands are stored to registers 2 and 3, respectively. 
As has been explained, these fixed-point instructions do 
not complete until after the store multiple instruction has 
completed in order to maintain resource coherency. 

In summary, the present invention addresses the 
substantial problem of the increased overhead associ- 
ated with serialized loads and stores. Such serialization 
of operations requires that the microprocessor's regis- 
ters be completely empty prior to dispatch of the serial- 
ized operation and that those resources remain allocat- 
ed for that serialized instruction until completion. In ac- 
cordance with the present invention, the microprocessor 
performance is substantially increased by not requiring 
serialization and by allowing additional subsequent in- 
structions to be executed simultaneously with load and 
store multiple instructions. Depending upon the micro- 



processor's completion buffer, a significant number of 
additional instructions may be executed by the micro- 
processor during the execution of the multi-register in- 
structions. For example, in the preferred embodiment, 

s with a completion buffer depth of five registers, there is 
a potential of up to four additional instructions complet- 
ing without a pipeline stall. Because such load and store 
multiple instructions can take up to 36 cycles to com- 
plete, this provides substantial time savings to enhance 

io microprocessor speed and efficiency. 



Claims 

15 1 . A method of executing multiple instructions in a su- 
perscalar microprocessor, including at least one 
load multiple instruction that loads to more than one 
register of a plurality of registers, comprising the 
steps of: 

20 

dispatching a load multiple instruction to a load/ 
store unit, wherein the load/store unit begins 
execution of a dispatched load multiple instruc- 
tion, and wherein the load multiple instruction 
25 loads data from memory into a plurality of reg- 

isters; 

maintaining a table that lists each register of the 
plurality of registers and that indicates when 
data has been loaded into each register by the 

30 executing load multiple instruction; and 

executing an instruction that is dependent upon 
source operand data loaded by the load multi- 
ple instruction into a register of the plurality of 
registers indicated by the instruction as a 

35 source register, prior to the load multiple in- 

struction completing its execution, when the ta- 
ble indicates the source operand data has been 
loaded into the source register. 

40 2. The method of claim 1 , wherein said dependent in- 
struction does not complete until after said load mul- 
tiple instruction has completed. 

3. A method of executing multiple instructions in a su- 
45 perscalar microprocessor, including at least one 
store multiple instruction that stores data from more 
than one register of a plurality of registers to mem- 
ory, comprising the steps of: 

so dispatching a store multiple instruction to a 

load/store unit, whereupon the load/store unit 
begins executing the store multiple instruction, 
wherein the store multiple instruction stores da- 
ta from a plurality of registers to memory; and 

55 executing a fixed point instruction that is de- 

pendent upon data being stored by the store 
multiple instruction from a register of the plural- 
ity of registers indicated by the fixed point in- 
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struction as a source register, prior to the store 
multiple instruction completing its execution, 
but prohibiting the executing fixed point instruc- 
tion from writing to a register of the plurality of 
registers prior to the store multiple instruction s 
completing. 

The method of claim 3, wherein said dependent in- 
struction does not complete until after said store 
multiple instruction has completed. io 

A superscalar microprocessor (10) having a plural- 
ity of execution units (22, 28, 30) for simultaneously 
executing a plurality of instructions and being con- 
nected to a memory, comprising: is 

a plurality of registers (32, 33, 36, 37) for selec- 
tively storing data resulting from execution by 
the execution units; 

a load/store execution unit (28) that executes 20 
load and store instructions which load or store 
data to or from the plurality of registers; 
a fixed point execution unit (22) for performing 
fixed point operations on operand data stored 
in source registers of the plurality of registers; 25 
and a dispatcher (20) that maintains a table list- 
ing at least some of the plurality of registers and 
indicating when data has been loaded into each 
register by a load multiple instruction executing 
in the load/store execution unit, wherein a load 30 
multiple instruction loads data from memory in- 
to more than one of the plurality of registers, 
and wherein the dispatcher dispatches instruc- 
tions to the plurality of execution units, includ- 
ing the load/store execution unit and the fixed 35 
point execution unit, and further wherein the 
dispatcher dispatches a load multiple instruc- 
tion to the load/store unit, which begins execu- 
tion of the load multiple instruction and further, 
prior to the load multiple instruction finishing its 40 
execution, dispatches a fixed point instruction 
which is dependent upon source operand data 
loaded by the load multiple instruction into a 
register of the plurality of registers indicated by 
the instruction as a source register, when the 
table indicates the source operand data has 
been loaded into the source register by the ex- 
ecution of the load multiple instruction in the 
load/store execution unit. 

so 

The microprocessor of claim 5, wherein said fixed 
point instruction does not complete until after said 
load multiple instruction has completed. 



The microprocessor of claim 5 or 6, wherein the dis 
patcher dispatches a store multiple instruction to 
the load/store unit, which begins execution of the 
store multiple instruction, wherein a store multiple 
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instruction stores data from more than one of the 
plurality of registers to memory, and further, prior to 
the store multiple instruction finishing, dispatches a 
fixed point instruction, which is dependent upon 
source operand data stored in a register of the plu- 
rality of registers that is being stored out to memory 
by the executing store multiple instruction, to the 
fixed point execution unit, wherein the fixed point 
execution unit executes the dispatched fixed point 
instruction prior to the store multiple instruction fin- 
ishing, but does not store the result of the executed 
fixed point instruction to a register of the plurality of 
registers prior to the store multiple instruction fin- 
ishing. 

A superscalar microprocessor (10) having a plural- 
ity of execution units (22, 28, 30) for simultaneously 
executing a plurality of instructions and being con- 
nected to a memory, comprising: 

a plurality of registers (32, 33, 36, 37)for selec- 
tively storing data resulting from execution by 
the execution units; 

a load/store execution unit (28) that executes 
load and store instructions which load or store 
data to or from the plurality of registers; 
a fixed point execution unit (22) for performing 
fixed point operations on operand data stored 
in source registers of the plurality of registers; 
a dispatcher (20) that dispatches instructions 
to the plurality of execution units, including the 
load/store execution unit and the fixed point ex- 
ecution unit, wherein the dispatcher dispatches 
a store multiple instruction to the load/store 
unit, which begins execution of the store multi- 
ple instruction, wherein a store multiple instruc- 
tion stores data from more than one of the plu- 
rality of registers to memory, and further, prior 
to the store multiple instruction finishing, dis- 
patches a fixed point instruction, which is de- 
pendent upon source operand data stored in a 
register of the plurality of registers that is being 
stored out to memory by the executing store 
multiple instruction, to the fixed point execution 
unit, wherein the fixed point execution unit ex- 
ecutes the dispatched fixed point instruction 
prior to the store multiple instruction finishing, 
but does not store the result of the executed 
fixed point instruction to a register of the plural- 
ity of registers prior to the store multiple instruc- 
tion finishing. 

The microprocessor of claim 8, wherein said fixed 
point instruction does not complete until after said 
store multiple instruction has completed. 
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